GENERAL TODO:

Internet Archive's S3-like Server API

Last Updated: $Date: 2011-10-06 +0000 (Thu, 06 Oct 2011) $

NOTA BENE

This document is very, very much a work in progress. It's so in-progress it's not even a first draft. Please do not assume this document is definitive until it gets committed and pushed live to archive.org (wrapped in appropriate lookfeel, etc.). Until then, feel free to reference this but the official IAS3 documentation can still be found at http://archive.org/help/abouts3.txt.

Introduction

This document covers the technical details of using Internet Archive's S3-like server API, aka "IAS3." The intended audience is a technical user, ideally one who is comfortable in the Linux/UNIX command line environment.

IAS3 is an API based upon Amazon's Simple Storage Service (aka S3). Whereas Amazon's S3 API allows you to store items in the Amazon S3 cloud storage service, the IAS3 API allows you to create items on and upload data to Internet Archive.

Because of its similarities to Amazon's S3, please familiarize yourself with the Amazon S3 documentation before using Internet Archive's IAS3.

What the IAS3 API Allows You To Do

foo: Check with Sam re: the examples marked below; they aren't working as expected. Also: don't like the section title.

In Internet Archive terminology, an item maps directly onto the Amazon S3 concept of a bucket. IAS3 allows you to create items née buckets, populate them with files and maintain the metadata for the item. You can also use IAS3 to control certain elements of file processing behavior. Internet Archive currently does not support file-level metadata.

Because Internet Archive items are analogous to Amazon S3 buckets they can be accessed using similar URL addresses. Items are typically accessed on Internet Archive using the IA-specific details/IDENTIFIER format. For instance:

http://www.archive.org/details/Sita_Sings_the_Blues

The link above will present the details page for the item on Internet Archive.

This same item is also available in an S3-like format of:

http://s3.us.archive.org/Sita_Sings_the_Blues

Or:

http://Sita_Sings_the_Blues.s3.us.archive.org/

These URLs will return XML containing information about the item.

Each file contained in an item can similarly be used as an S3-like key in a URL:

http://Sita_Sings_the_Blues.s3.us.archive.org/Sita_Sings_the_Blues_small.mp4

Performing a PUT on the Internet Archive equivalent to an S3 endpoint will result in the creation of a new item in Internet Archive. Files may be added to the item in the same manner. Both of these operations may be combined in a single PUT command. For example, using curl:

curl --location --header 'x-amz-auto-make-bucket:1' \
--header 'x-archive-meta01-collection:opensource' \
--header 'x-archive-meta-mediatype:texts' \
--header 'x-archive-meta-sponsor:Andrew W. Mellon Foundation' \
--header 'x-archive-meta-language:eng' \
--header "authorization: LOW $accesskey:$secret" \
--upload-file /home/samuel/public_html/intro-to-k.pdf \
http://s3.us.archive.org/sam-s3-test-08/demo-intro-to-k.pdf

How IAS3 Differs From Amazon S3

IAS3 differs from Amazon's S3 API in several significant ways:

IAS3 also supports several of its own headers. These are discussed in more detail below.

System Requirements

In order to use IAS3 to upload to Internet Archive, you must have:

Using S3 Clients to Access IAS3

Internet Archive strives to make IAS3 compatible with current Amazon S3 client code. Ideally running the following command—replacing amazonaws.com with us.archive.org—on your S3 client code would allow you to use IAS3 with no further changes necessary:

perl -pi -e 's/amazonaws.com/us.archive.org/g' *

Some Amazon S3 clients obey configuration files, many of will will allow you to define the preferred S3 hostname. Setting this hostname to s3.us.archive.org in the configuration file should allow the client code to upload to Internet Archive with no further changes.

For instance, adding the following to your ~/.s3cfg configuration file for s3cmd, a popular Amazon S3 client, will allow you to connect to IAS3:

[default]
access_key = YOUR-ACCESS-KEY
secret_key = YOUR-SECRET-KEY
host_base = s3.us.archive.org
host_bucket = %(bucket)s.s3.us.archive.org

Passing Authorization Credentials to IAS3

Authorization credentials may be passed to IAS3 by your Amazon S3-compatible client via configuration file (see above). In addition there is a clear text password mode. To use this mode, pass your access and secret keys as values to the Authorization header:

Authorization: LOW $accesskey:$secret

This is the authorization method shown in most of the examples in this document.

Commonly Used Amazon S3 Headers

foo: are there any more of these? only the one ever appears in the examples

Most Amazon S3 headers can also be used with IAS3. This section briefly discusses the most commonly used Amazon S3 headers.

x-amz-auto-make-bucket

The x-archive-auto-make-bucket header allows you to both create an item and upload directly to it with a single command.

To enable this option, pass the x-archive-auto-make-bucket header with a value of 1. If you do not specify this value you must create an item before you attempt to upload to it. The default value for this header is 0.

This header only works when PUTting to IAS3.

Internet Archive-specific IAS3 Headers

foo: I really don't like the formatting here. Maybe add a standard table to each header, listing where it can be used (PUT/GET/DELETE, etc.), valid values, default value?

Internet Archive has implemented specialized headers for controlling certain operations upon objects and files via IAS3.

x-archive-cascade-delete

Normal DELETE operation is to remove only the specified file. The x-archive-cascade-delete header allows you to delete not only a file but also all derivative and original files associated with it. The Internet Archive derivatives help page provides additional information about the files which may be deleted in this operation.

To enable this option, pass the x-archive-cascade-delete header with a value of 1. The default value for this header is 0.

This header only works when DELETING a file within an item. Nota bene: DELETE is not allowed for items (buckets) in IAS3. You may only DELETE a file and its derivatives.

x-archive-ignore-preexisting-bucket

A normal PUT operation including x-archive-meta-* headers will overwrite an existing IDENTIFIER_meta.xml file. The x-archive-ignore-preexisting-bucket header will instead overwrite the existing IDENTIFIER_meta.xml file with the x-archive-meta-*- header values passed in the same PUT command.

To enable this option, pass the x-archive-ignore-preexisting-bucket header with a value of 1. The default value for this header is 0.

This header only works when PUTting to IAS3.

x-archive-keep-old-version

Normal PUT operation will overwrite a file when it is used to upload a file of the same name. A normal DELETE operation will remove the specified file. The x-archive-keep-old-version header will rename the specified file, prepending the filename with .~~ before proceding with the PUT or DELETE operation.

To enable this option, pass the x-archive-keep-old-version header with a value of 1. The default value for this header is 0.

Caution! This header is experimental. Its use could result in unexpected results if interleaved with PUTs which do not use this header.

This header works for both PUT and DELETE for IAS3.

x-archive-meta-*

The x-archive-meta-* header is used for setting metadata values for an item. This header is discussed in detail below.

x-archive-queue-derive

Normal operation after a file has been PUT into an item is to queue it for derivation to other file formats. When PUTting either a very large file or a large number of files can bog down the derivation process and slow system performance. In these instances it is preferable to disable automatically derive queueing.

Please note: Files may be queued for derivation following upload. To queue an individual file, navigate to the item detail page on Internet Archive and click the Edit Item! link at the top. If you have several files which need to be queued, contact Internet Archive for assistance.

To disable automated creation of derivative files, pass the x-archive-queue-derive header with a value of 0. The default value for this header is 1.

This header works only when PUTting to IAS3.

x-archive-size-hint

If the total size of files in your item will exceed 10 gigabytes, Internet Archive recommends you declare the size at the time of bucket creation. This allows the Internet Archive catalog to more easily place the item for storage, facilitating a potential speed boost to the upload.

To enable this option, pass the x-archive-size-hint header with a value of the file size in bytes. If this header is not defined IAS3 will attempt to default to the value in the content-length header.

This header works only when PUTting to IAS3.

IAS3 Identifiers

Each item at Internet Archive has a identifier. An identifier is composed of any unique combination of alphanumeric characters, underscore (_) and dash (-). While there are no official limits it is strongly suggested that they be between 5 and 80 characters in length.

Identifiers must be unique across the entirety of Internet Archive, not simply unique within a single collection.

Once defined an identifier can not be changed. It will travel with the item or object and is involved in every manner of accessing or referring to the item.

In IAS3, identifiers are defined implicitly in the target URL. For example:

curl --location --header 'x-amz-auto-make-bucket:1' \
--header "Authorization: LOW $accesskey:$secret" \
--header "x-archive-meta-collection:test_collection" \
--upload-file /Users/archive/Desktop/The_Open_Source_Way_03.pdf \
http://s3.us.archive.org/vmb_tosw_trial_upload_03/The_Open_Source_Way_03.pdf

The identifier in this command is vmb_tosw_trial_upload_03. The item may be viewed at its details page. The details page for any item is simply http://archive.org/details/ followed by the identifier. The details page for this example is:

http://archive.org/details/vmb_tosw_trial_upload_03

Settings Metadata Values via Headers

The x-archive-meta-* header is used to set metadata values for items. At this time Internet Archive does not support file-level metadata. Metadata may only be defined at an item level.

All metadata fields are defined as key-value pairs passed via headers. The header format is:

x-archive-meta-FIELDNAME:FIELDVALUE

For instance, if you are using curl you may set a value for the title metadata field using this header:

--header "x-archive-meta-title:John Muir on Hetch Hetchy" \

Alternatively, you may use the Amazon S3 standard x-amz-meta-FIELDNAME:FIELDVALUE header for setting metadata.

Metadata headers are sorted prior to processing. This sorting includes the x-amz- or x-archive- header prefixes, therefore if you use both of these prefixes when setting metadata values the fields set with x-amz- will be processed first and may cause unexpected behavior. To avoid potential problems it is advised that you use either the x-archive- or the x-amz- header prefix when setting metadata, not both.

All metadata header values are interpreted as UTF-8 encoded characters.

Standard Internet Archive Metadata Fields

There are several standard metadata fields recognized for Internet Archive items. All metadata fields except identifier are optional.

foo: alphabetize these
foo: standardize wording; it's all over the place
foo: field or tag? Pick a term and stick with it

hidden

foo: what's this do? It's admin/owner-only and doesn't appear on editxml.php

identifier

Each item at Internet Archive has a identifier. An identifier is composed of any unique combination of alphanumeric characters, underscore (_) and dash (-). While there are no official limits it is strongly suggested that they be between 5 and 80 characters in length.

An identifier can not be defined via metadata header. Instead identifiers are defined implicitly in the target URL. Please see IAS Identifiers above for additional information.

title

The title for the item. This appears in the header of the item's detail page on Internet Archive.

If a value is not specified for this field it will default to the identifier for the item.

creator

An entity primarily responsible for creating the files contained in the item.

mediatype

The primary type of media contained in the item. While an item can contain files of diverse mediatypes the value in this field defines the appearance and functionality of the item's detail page on Internet Archive. In particular, the mediatype of an item defines what sort of online viewer is available for the files contained in the item.

The mediatype metadata field recognizes a limited set of values:

  • audio
    The majority of audio items should receive this mediatype value. Items for the Live Music Archive should instead use the etree value.
  • data
    This is the default value for mediatype. Items with a mediatype of data will be available in Internet Archive but you will not be able to browse to them. In addition there will be no online reader/player for the files.
  • etree
    Items which contain files for the Live Music Archive should have a mediatype value of etree. The Live Music Archive has very specific upload requirements. Please consult the documentation for the Live Music Archive prior to creating items for it.
  • image
    Items which predominantly consist of image files should receive a mediatype value of image. Currently these items will not available for browsing or online viewing in Internet Archive but they will require no additional changes when this mediatype receives additional support in the Archive.
  • movies
    All videos (television, features, shorts, etc.) should receive a mediatype value of movies. These items will be displayed with an online video player.
  • software
    Items with a mediatype of software are accessible to browse via Internet Archive's software collection. There is no online viewer for software but all files are available for download.
  • texts
    Items with a mediatype of texts will appear with the online bookreader. Internet Archive will also attempt to OCR files in these items.
  • web
    The web mediatype value is reserved for items which contain web archive WARC files.

If the mediatype value you set is not in the list above it will be saved but ignored by the system.

This field may be modified only by an administrator or the owner of the item.

If a value is not specified for this field it will default to data.

collection

A collection is a specialized item used for curation and aggregation of other items. Assigning an item to a collection defines where the item may be located by a user browsing Internet Archive. To assign an item to a collection, pass its identifier as the value for an x-archive-metadata-collection header. For example, if you are using curl you can assign an item to the Community Texts collection (identifier: opensource) with the following header:

--header 'x-archive-metadata-collection:opensource' \

A collection must exist prior to assigning any items to it. Currently collections can only be created by Internet Archive staff members. Please contact Internet Archive if you need a collection created.

description

A description of the item.

The value of this metadata field may contain HTML. <script> tags and CSS are not allowed.

date

The publication, production or other similar date of this item. Please use an ISO 8601 compatible format for this date. For instance, these are all valid date formats:

  • YYYY
  • YYYY-MM-DD
  • YYYY-MM-DD HH:MM:SS

subject

Keyword(s) or phrase(s) that may be searched for to find your item. Separate each keyword or phrase with a semicolon (";") character. It is helpful but not necessary for you to use Library of Congress Subject Headings for the value of this metadata header.

licenseurl

A URL to the license which covers the works contained in the item.

Internet Archive recommends (but does not require) Creative Commons licensing. Creative Commons provides a license selector for finding the correct license for your needs.

pick

Each collection page on Internet Archive may include a "Staff Picks" section. This section will highlight a single item in the collection. This item will be selected at random from the items with a pick metadata value of 1. If there are no items with this pick metadata value the "Staff Picks" section will not appear on the collection page.

This field may be modified only by an administrator or the owner of the item.

By default all new items have no pick metadata value.

noindex

All items will have their metadata included in the Internet Archive search engine. To disable indexing in the search engine, include a noindex metadata tag. The value of the tag does not matter. Its presense is enough to trigger not including the metadata in the search engine.

If an item's metadata has already been indexed in the search engine, setting noindex will remove it from the index.

Items whose metadata is not not included in the search engine index are not considered "public" per se and therefore will not have a value in the publicdate metadata field (see below).

publicdate

foo: date format accepted?

Items which have had their metadata included in the Internet Archive search engine index are considered to be public. The date the metadata is added to the index is the public date for the item.

This field may be modified only by an administrator or the owner of the item.

While it is possible to set the publicdate metadata value it is not recommended. This value is typically set by automated processes.

addeddate

foo: date format accepted?

The addeddate metadata tag contains the date the item was added to Internet Archive.

While it is possible to set the addeddate metadata value it is not recommended. This value is typically set by automated processes.

adder

foo: pretty sure this value is the username, not the screen name. Screen name is only in the display.

The screen name of the account which added the item to the Internet Archive.

While is is possible to set the adder metadata value it is not recommended. This value is typically set by automated processes.

uploader

The Internet Archive username of the account which uploaded the file(s) to the item.

While it is possible to set the uploader metadata value it is not recommended. This value is typically set by automated processes.

updater

The Internet Archive username of the account which updated the item. This field is repeatable.

This field may be modified only by an administrator or the owner of the item.

While it is possible to set the updater metadata value it is not recommended. This value is typically set by automated processes.

updatedate

foo: date format?

The date on which an update was made to the item. This field is repeatable.

This field may be modified only by an administrator or the owner of the item.

While it is possible to set the updatedate metadata value it is not recommended. This value is typically set by automated processes.

notes

The notes metadata field can contain any information about the item.

The value of this metadata field may contain HTML. <script> tags and CSS are not allowed.

rights

The value of the rights metadata field should be a statement of the rights held in and over the item.

The value of this metadata field may contain HTML. <script> tags and CSS are not allowed.

contributor

The value of the contributor metadata field is information about the entity responsible for making contributions to the content of the item. This is often the library, organization or individual making the item available on Internet Archive.

The value of this metadata field may contain HTML. <script> tags and CSS are not allowed.

publisher

The publisher of the material available in the item.

language

The primary language of the material available in the item.

While the value of the language metadata field can be any value, Internet Archive prefers they be MARC21 Language Codes.

coverage

The extent or scope of the content of the material available in the item. The value of the coverage metadata field may include geographic place, temporal period, jurisdiction, etc. For items which contain multi-volume or serial content, place the statement of holdings in this metadata field.

credits

If known, enter the participants in the production of the materials contained in the item in the credits metadata field.

The value of this metadata field may contain HTML. <script> tags and CSS are not allowed.

Custom Metadata Fields

Internet Archive strives to be metadata agnostic, enabling users to define the metadata format which best suits the needs of their material. In addition to the standard metadata fields listed above you may also define as many custom metadata fields as you require. These metadata fields can be defined ad hoc at item creation or metadata editing time and do not have to be defined in advance. For instance, if your organization uses the PBCORE metadata schema you can include the appropriate metadata fields in your Internet Archive item:

x-archive-meta-pbcoreGenre:Educational
x-archive-meta-pbcoreCoverage:Long Beach, CA
x-archive-meta-pbcoreCoverageType:Spatial
etc.

PLEASE NOTE! RFC 822 disallows the underscore character (_) in HTTP header names. Therefore to use an underscore in the name of a custom metadata field you must replace the underscore (_) with two hyphens (--). These will be translated into an underscore character when the metadata is processed by the server. For example:

x-archive-meta-isbn--10:080652510X

This example will generate a metadata field named isbn_10.

Repeating Metadata Fields

Certain metadata fields such as collection and subject can be repeated. To repeat a metadata header you must sequentially number each instance of the header in your command:

x-archive-meta01-$meta_name:$meta_value_a
x-archive-meta02-$meta_name:$meta_value_b

foo: Need a better example? vaguely recall collections don't need the number but go in order of appearance in the command...?

For example, if an item contains both PDF and mp3 files you may assign it to both the texts and opensource_audio collections by including the following two lines in a curl command:

--header 'x-archive-meta01-collection:texts' \
--header 'x-archive-meta02-collection:opensource_audio' \

Setting Metadata Values via Files

While the preferred and recommended method for setting Internet Archive item metadata is via headers, it is possible to provide files containing metadata. If you choose to provide a metadata file instead of using headers it is strongly recommended that the metadata file be the first uploaded during item creation.

When providing a metadata file, please provide only one file per item. It is not necessary to provide a metadata file in each format. Additional files will be generated automatically from the one which you provide.

The valid metadata file formats:

IDENTIFIER_marc.xml

This file must contain metadata in well-formed MARCXML format. It must be named appropriately or it will not be recognized. The proper naming scheme is the items identifier followed by _marc.xml.

An example IDENTIFIER_marc.xml file can be found in the Appendix.

IDENTIFIER_meta.mrc

This file must contain metadata in binary MARC format according to the ISO 2709 standard. It must be named appropriately or it will not be recognized. The proper naming scheme is the items identifier followed by _meta.mrc.

An example IDENTIFIER_meta.mrc file can be found in the Appendix.

How These Metadata Files Are Processed

If an IDENTIFIER_meta.mrc file is located it is used to generate an IDENTIFIER_marc.xml file. Any existing IDENTIFIER_marc.xml file will be overwritten by this operation.

The IDENTIFIER_marc.xml file is used to generate a IDENTIFIER_dc.xml file of Dublin Core metadata. Any existing IDENTIFIER_dc.xml file will be overwritten by this operation. The Dublin Core fields are extracted and populated according to the Library of Congress MARC21 to Dublin Core XSL stylesheet. In addition, Internet Archive will extract information from the following MARCXML fields:

The item's definitive metadata file, IDENTIFIER_meta.xml is generated from the IDENTIFIER_dc.xml Dublin Core file.

Special Files

Each Internet Archive item is comprised of several files. Many of these files are automatically generated and should not be either removed or modified.

IDENTIFIER_meta.xml

IDENTIFIER_meta.xml is the definitive metadata file for the item. It is automatically generated at item creation time using the metadata provided either via headers or via files.

Please do not delete or modify this file. If you must modify the item's metadata, please either use the "Edit Item!" link at the top of its detail page or submit updated metadata via the IAS3 API. See the x-archive-ignore-preexisting-bucket header for additional information about updating an item's metadata via IAS3.

IDENTIFIER_files.xml

IDENTIFIER_files.xml is an auto-generated file cataloging all of the files contained in the item. In addition to the filenames the IDENTIFIER_files.xml file will also list the file format and various hashes for each file. If the file is a derivative IDENTIFIER_files.xml will list the original file from which it was derived.

For example, here is an extract from the japanesefairytal00ozak_files.xml file for Japanese Fairy Tales:

<file name="japanesefairytal00ozak.epub" source="derivative">
  <format>EPUB</format>
  <original>japanesefairytal00ozak_abbyy.gz</original>
  <mtime>1294020233</mtime>
  <size>1230045</size>
  <md5>1d87b3e04ca0b617e041bbcb0cd7f1a5</md5>
  <crc32>7326f3ce</crc32>
  <sha1>5434df04b1b811b03e7d9a32bde3119d3ca924c8</sha1>
</file>

Please do not delete or modify this file.

IDENTIFIER_rules.conf

Whereas the x-archive-queue-derive header enables or disables queuing files for deriving for the entire item, it it possible to disable the creation of certain derive files using the IDENTIFIER_rules.conf file. There are three options for selecting which derivative formats to disable via IDENTIFIER_rules.conf:

Specific File Formats

You may disable creation of specific derivative formats by listing them#&8212;one format per line—in the IDENTIFIER_rules.conf file. The valid values for file formats can be found in the header rows of the derivatives chart.

For example, to disable creation of h.264 and Ogg Video derivative files, your IDENTIFIER_rules.conf file should contain the following:

h.264
Ogg Video

Only 'lossy' File Formats

Some derivative file formats (mp3, ogg, ogv, mp4, webm) are considered "lossy" because they use a compression algorithm which produces a file which is not identical to the original. To prevent the creation of lossy derivatives, upload a IDENTIFIER_rules.conf file containing this line:

CAT.lossy

All Derivatives

It is possible to disable creation of all derivative files using the IDENTIFIER_rules.conf file. This is equivalent to setting the x-archive-queue-derive header to a value of 0.

To disable all derivatives, upload a IDENTIFIER_rules.conf file containing this line:

CAT.all

A new or modified IDENTIFIER_rules.conf file will not be recognized until a new derive process is initiate for the item. It is currently not possible to initiate this process via the IAS3 API. To initiate the derive process in the Internet Archive interface:

  1. Login to Internet Archive
  2. Navigate to the item's page on Internet Archive
  3. Find the "Edit Item!" link in the upper right of the item
  4. Click the "change the information" link
  5. Click the "Item Manager" link near the top of the page
  6. Click the "derive" button

If now-excluded formats had previously been derived, initiating a derive process will remove the files from the item.

Troubleshooting

Viewing a log of your IAS3 object

Each file uploaded to Internet Archive via IAS3 will have a log file. To view the log, append ?log to the URL of the endpoint. For example:

http://s3.us.archive.org/sam-s3-test-08/demo-intro-to-k.pdf?log

Please note: The log format may change at any time.

My file isn't appearing in the item.

When a file is added to an item it is staged in temporary storage and ingested via the Archive's content management system. While this usually happens very quickly, during periods of heavy system load this process can take a few minutes.

It is also possible that you are viewing a cached version of the item's detail page. Please either clear your web browser's cache or append this parameter to the item's URL:

reCache=1

Is there sandbox I can use for testing IAS3?

Internet Archive provides a collection where you can test your item creation and uploads. Items assigned to this collection are removed from the Archive once every thirty days or so. To use this collection, assign your test items to it using this header:

x-archive-meta-collection:test_collection

Please remember that item identifiers must be unique across the entire Archive, including for items in the test collection. Your test scripts may need to be modified to avoid identifier collision once you start creating and uploading to non-test items.

What happens to my item/file after uploading?

Several processes will operate on your item after it has been created and after each file is added to it. These processes include archiving the content, deriving new files from your originals and backing up the item and its contents. You may view the progress of any of these processes on the item's catalog page:

http://www.archive.org/catalog.php?history=1&identifier=IDENTIFIER

Clicking the task_id for a process will display a detailed log for it.

You may also reach this page by clicking the 'Item History' link on the item's detail page on Internet Archive.

Is there any way to control how files derive?

You may use either the IDENTIFIER_rules.conf file or the x-archive-queue-derive header to control the creation of derivative files from your originals.

Downloading via IAS3

While the IAS3 API supports both GET and HEAD methods for retrieving files, higher performance can be achieved via the Internet Archive web architecture.

Each file in an Internet Archive item can be retrieved via a /download/ link:

http://archive.org/download/IDENTIFIER/FILENAME.EXT

This is the recommended method for downloading files from Internet Archive.

Code Examples

curl

Text item (a PDF will be OCR'd):

curl --location --header 'x-amz-auto-make-bucket:1' \
--header 'x-archive-meta01-collection:opensource' \
--header 'x-archive-meta-mediatype:texts' \
--header 'x-archive-meta-sponsor:Andrew W. Mellon Foundation' \
--header 'x-archive-meta-language:eng' \
--header "authorization: LOW $accesskey:$secret" \
--upload-file /home/samuel/public_html/intro-to-k.pdf \
http://s3.us.archive.org/sam-s3-test-08/demo-intro-to-k.pdf

Movie item (Will get video player on details page):

curl --location --header 'x-amz-auto-make-bucket:1' \
--header 'x-archive-meta01-collection:opensource_movies' \
--header 'x-archive-meta-mediatype:movies' \
--header 'x-archive-meta-title:Ben plays piano.' \
--header "authorization: LOW $accesskey:$secret" \
--upload-file ben-2009-05-09.avi \
http://s3.us.archive.org/ben-plays-piano/ben-plays-piano.avi

Uploading a file to an existing item:

curl --location \
--header "authorization: LOW $accesskey:$secret" \
--upload-file /home/samuel/public_html/intro-to-k.pdf \
http://s3.us.archive.org/sam-s3-test-08/demo-intro-to-k.pdf

Destroy and respecify the metadata for an item:

curl --location \
--header 'x-archive-ignore-preexisting-bucket:1' \
--header 'x-archive-meta01-collection:opensource' \
--header 'x-archive-meta-mediatype:texts' \
--header 'x-archive-meta-title:Fancy new title' \
--header "authorization: LOW $accesskey:$secret" \
--upload-file /dev/null \
http://s3.us.archive.org/sam-s3-test-08

A Movie example with subject keywords, and creative commons license:

curl --location --header 'x-archive-ignore-preexisting-bucket:1' \
--header "authorization: LOW $accesskey:$secret" \
--header 'x-archive-meta-mediatype:movies' \
--header 'x-archive-meta-collection:opensource_movies' \
--header 'x-archive-meta-title:electricsheep-flock-244' \
--header 'x-archive-meta-creator:Scott Draves and the Electric Sheep' \
--header 'x-archive-meta-description:Archive of flock 244 of the Electric Sheep, see http://electricsheep.org and http://scottdraves.com' \
--header 'x-archive-meta-date:2009' \
--header 'x-archive-meta-year:2009' \
--header 'x-archive-meta-subject:electricsheep,alife,art,draves,spotworks,evolution,algorithm' \
--header 'x-archive-meta-licenseurl:http://creativecommons.org/licenses/by-nc/3.0/us/' \
--upload-file /dev/null \
http://s3.us.archive.org/electricsheep-flock-244

Perl

An extract of a script for uploading multiple files via IAS3 using LWP

my $ua = LWP::UserAgent->new();
$ua->agent('upload_via_IAS3/' . VERSION);
$ua->timeout(20);
$ua->env_proxy;

$ua->default_headers->push_header('authorization'=>"LOW $ias3keys");

# start actual upload tasks, doing some optimization.
# - items with no file to upload are not created
# - item creation is always combined with the first file upload
my @uploadQueue = @{$task->{files}};
while (@uploadQueue) {
    my $file = shift @uploadQueue;
    my $uripath = "/" . $file->{item}{name} . "/" . $file->{filename};
    warn "File: ", $file->{file}, " -> ", $uripath, "\n";
    if (!$forceupload && $file->{uploaded}) {
  my $last = $file->{uploaded};
  # this file was uploaded in previous run. re-upload it only when
  # something has changed.
  if ($file->{mtime} <= $last->{mtime} &&
      $file->{item}{name} eq $last->{itemName} &&
      $file->{filename} eq $last->{filename}) {
      warn "skipping - no change since last upload\n";
      next;
  }
    }
    if ($checkstore) {
  my $dlurl = IADLURLBASE . $uripath;
  print STDERR "checking ", $dlurl, "...\n" if $verbose;
  my $res = $ua->head($dlurl);
  if ($res->is_success) {
      # file exists - check date (of last upload) against file's mtime
      my $m = $res->headers->{'date'};
      if ($m && str2time($m) >= $file->{mtime}) {
    warn "skipping - upload date later than file's mtime\n";
    next;
      }
  } else {
      # 404 or other failure - upload the file
      print $res->status_line, "\n";
  }
    }
  
    my $waitUntil = $file->{waitUntil};
    if (defined $waitUntil) {
  my $sec = $waitUntil - time();
  while ($sec > 0) {
      print STDERR "holding off $sec second", ($sec > 1 ? 's' : ''), "... ";
      sleep(1);
      $sec--;
  } continue { print STDERR "\r"; }
  print STDERR "\n";
  delete $file->{waitUntil};
    }
    # ok, ready to go
    my $item = $file->{item};
    my @headers = ();
    # prepare item metadata if the item hasn't been created yet (in this
    # session) - it might exist on the server.
    unless ($item->{created}) {
  my $metadata = $item->{metadata};

  # prepare actual HTTP headers for metadata
  push(@headers, 'x-amz-auto-make-bucket', 1);

  # As metadata (most often 'collection' and 'subject') may have multiple
  # values, %metadata has an array for each metadata name (in come case,
  # notably 'title', may be a scalar). If there in fact multiple values,
  # we use metadata header in indexed form. If there's only one value
  # (either in an array or as a scalar), we use basic form. Special metadata
  # 'collection' is also handled by this same logic.
  while (my ($h, $v) = each %$metadata) {
      push(@headers, metadataHeaders($h, $v));
  }

  # add metadata headers for collections item gets associated with
  my @collectionNames = map($_->{name}, @{$item->{collections}});
  push(@headers, metadataHeaders('collection', \@collectionNames));

  # overwrite existing bucket unless user explicitly told not to.
  unless ($keepExistingMetadata) {
      push(@headers, 'x-archive-ignore-preexisting-bucket', '1');
  }

  # size-hint
  if ($item->{size}) {
      push(@headers, 'x-archive-size-hint', $item->{size});
  }
    }
    # no-derive flag should go with all files
    if ($noDerive) {
  push(@headers, 'x-archive-queue-derive', '0');
    }
    # Expect header
    push(@headers, 'expect', '100-continue');

    my $uri = IAS3URLBASE . $uripath;
    my $content = $file->{path};
    
    if ($verbose) {
  print STDERR "PUT $uri\n";
  for (my $i = 0; $i < $#headers; $i += 2) {
      print STDERR $headers[$i], ":", $headers[$i + 1], "\n";
  }
    }

    if ($dryrun) {
  print STDERR "## dry-run; not making actual request\n";
    } else {
  # use of custom PUT_FILE is for efficient handling of large files.
  # see comment on PUT_FILE above.
  my $req = PUT_FILE $uri, $content, @headers;
  #print STDERR $req->as_string;
  my $res = $ua->request($req);
  print STDERR "\n";
  if ($res->is_success) {
      print $res->status_line, "\n";
      $res->headers->scan(sub { print "$_[0]: $_[1]\n"; }) if $verbose;
      print $res->content, "\n" if $verbose;
      print "\n";
  } else {
      print $res->status_line, "\n", $res->content, "\n\n";
      if ($res->code == 503) {
    # Service Unavailable - asking to slow down
    $file->{waitUntil} = time() + 120;
    # put it at the head so that it blocks transfer
    unshift(@uploadQueue, $file);
      } elsif (++$file->{failCount} < 5) {
    $file->{waitUntil} = time() + 120;
    push(@uploadQueue, $file);
      } else {
    # give up
      }
      next;
  }
    }
    
    $item->{created} = 1;
}

Other Languages

Examples for additional languages are still pending. If you have any you would like to provide, please contact Internet Archive.

Support

For assistance with the IAS3 API, please contact Internet Archive. Please include the string "IAS3 Help" somewhere in the subject line.

Appendices

Terminology

Bucket
'Bucket' is the Amazon S3 term for a container for your files. For the IAS3 API a bucket is equivalent to an Internet Archive item.
Collection
A collection is a specialized Internet Archive item used for aggregating related collections and items. An item or collection is assigned to a collection via the x-archive-meta-collection metadata header.
Derivative
A derivative is a file which Internet Archive will automatically generate from the original file which you provide. Derivatives enable as many people as possible to access the file while also protecting against file format obsolescence. Please refer to this chart to see which derivative formats Internet Archive will produce.
Identifier
Each item at Internet Archive has a identifier. An identifier is composed of any unique combination of alphanumeric characters, underscore (_) and dash (-). While there are no official limits it is strongly suggested that they be between 5 and 80 characters in length. An identifier must be unique across the entirety of Internet Archive.
Item
An item is the primary entity of the Internet Archive. All of the files you upload will be contained in items. Each item has its own Internet Archive page, also known as its details page. The details page can be accessed using the following URL pattern:
http://archive.org/details/IDENTIFIER

Internet Archive's Item Structure (in brief)

foo: Fill in this section

Items == have 'detail' pages ; metadata

Can be several files per items (http link)

For info on IA's item structure:
http://www.archive.org/about/faqs.php
(sorry!)
You can also look at an item's structure directly by clicking the HTTP link shown on a details page. ex: http://archive.org/details/stats

IAS3 HTTP Return Codes

The IAS3 API may return the following HTTP Return Codes:

HTTP Return Code Code Meaning
102Processing
200Ok
201Created
204No Content
207Multi-Status
400Bad Request
403Forbidden
404Not Found
405Method Not Allowed
409Conflict
412Precondition failed
415Unsupported Media Type
422Unprocessable Entity
423Locked
424Failed Dependency
502Bad Gateway
507Insufficient Storage

Error Messages

IAS3 may return the following error messages:

Error Code Error Message HTTP Code Returned
AccessDenied Access Denied 403 Forbidden
AccountProblem There is a problem with your AWS account that prevents the operation from completing successfully. Please contact customer service at webservices@amazon.com. 403 Forbidden
AmbiguousGrantByEmailAddress The e-mail address you provided is associated with more than one account. 400 Bad Request
BadDigest The Content-MD5 you specified did not match what we received. 400 Bad Request
BucketAlreadyExists The requested bucket name is not available. The bucket namespace is shared by all users of the system. Please select a different name and try again. 409 Conflict
BucketAlreadyOwnedByYou Your previous request to create the named bucket succeeded and you already own it. 409 Conflict
BucketNotEmpty The bucket you tried to delete is not empty. 409 Conflict
CredentialsNotSupported This request does not support credentials. 400 Bad Request
CrossLocationLoggingProhibited Cross location logging not allowed. Buckets in one geographic location cannot log information to a bucket in another location. 403 Forbidden
EntityTooSmall Your proposed upload is smaller than the minimum allowed object size. 400 Bad Request
EntityTooLarge Your proposed upload exceeds the maximum allowed object size. 400 Bad Request
ExpiredToken The provided token has expired. 400 Bad Request
IncompleteBody You did not provide the number of bytes specified by the Content-Length HTTP header 400 Bad Request
IncorrectNumberOfFilesInPostRequest POST requires exactly one file upload per request. 400 Bad Request
InlineDataTooLarge Inline data exceeds the maximum allowed size. 400 Bad Request
InternalError We encountered an internal error. Please try again. 500 Internal Server Error
InvalidAccessKeyId The AWS Access Key Id you provided does not exist in our records. 403 Forbidden
InvalidAddressingHeader You must specify the Anonymous role. N/A
InvalidArgument Invalid Argument 400 Bad Request
InvalidBucketName The specified bucket is not valid. 400 Bad Request
InvalidDigest The Content-MD5 you specified was an invalid. 400 Bad Request
InvalidLocationConstraint The specified location constraint is not valid. 400 Bad Request
InvalidPayer All access to this object has been disabled. 403 Forbidden
InvalidPolicyDocument The content of the form does not meet the conditions specified in the policy document. 400 Bad Request
InvalidRange The requested range cannot be satisfied. 416 Requested Range Not Satisfiable
InvalidSecurity The provided security credentials are not valid. 403 Forbidden
InvalidSOAPRequest The SOAP request body is invalid. 400 Bad Request
InvalidStorageClass The storage class you specified is not valid. 400 Bad Request
InvalidTargetBucketForLogging The target bucket for logging does not exist, is not owned by you, or does not have the appropriate grants for the log-delivery group. 400 Bad Request
InvalidToken The provided token is malformed or otherwise invalid. 400 Bad Request
InvalidURI Couldn't parse the specified URI. 400 Bad Request
KeyTooLong Your key is too long. 400 Bad Request
MalformedACLError The XML you provided was not well-formed or did not validate against our published schema. 400 Bad Request
MalformedPOSTRequest The body of your POST request is not well-formed multipart/form-data. 400 Bad Request
MaxMessageLengthExceeded Your request was too big. 400 Bad Request
MaxPostPreDataLengthExceededError Your POST request fields preceding the upload file were too large. 400 Bad Request
MetadataTooLarge Your metadata headers exceed the maximum allowed metadata size. 400 Bad Request
MethodNotAllowed The specified method is not allowed against this resource. 405 Method Not Allowed
MissingAttachment A SOAP attachment was expected, but none were found. N/A
MissingContentLength You must provide the Content-Length HTTP header. 411 Length Required
MissingSecurityElement The SOAP 1.1 request is missing a security element. 400 Bad Request
MissingSecurityHeader Your request was missing a required header. 400 Bad Request
NoLoggingStatusForKey There is no such thing as a logging status sub-resource for a key. 400 Bad Request
NoSuchBucket The specified bucket does not exist. 404 Not Found
NoSuchKey The specified key does not exist. 404 Not Found
NotImplemented A header you provided implies functionality that is not implemented. 501 Not Implemented
NotSignedUp Your account is not signed up for the Amazon S3 service. You must sign up before you can use Amazon S3. You can sign up at the following URL: http://aws.amazon.com/s3 403 Forbidden
OperationAborted A conflicting conditional operation is currently in progress against this resource. Please try again. 409 Conflict
PermanentRedirect The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint. 301 Moved Permanently
PreconditionFailed At least one of the pre-conditions you specified did not hold. 412 Precondition Failed
Redirect Temporary redirect. 307 Moved Temporarily
RequestIsNotMultiPartContent Bucket POST must be of the enclosure-type multipart/form-data. 400 Bad Request
RequestTimeout Your socket connection to the server was not read from or written to within the timeout period. 400 Bad Request
RequestTimeTooSkewed The difference between the request time and the server's time is too large. 403 Forbidden
RequestTorrentOfBucketError Requesting the torrent file of a bucket is not permitted. 400 Bad Request
SignatureDoesNotMatch The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. For more information, see Authenticating REST Requests and Authenticating SOAP Requests for details. 403 Forbidden
SlowDown Please reduce your request rate. 503 Service Unavailable
TemporaryRedirect You are being redirected to the bucket while DNS updates. 307 Moved Temporarily
TokenRefreshRequired The provided token must be refreshed. 400 Bad Request
TooManyBuckets You have attempted to create more buckets than allowed. 400 Bad Request
UnexpectedContent This request does not support content. 400 Bad Request
UnresolvableGrantByEmailAddress The e-mail address you provided does not match any account on record. 400 Bad Request
UserKeyMustBeSpecified The bucket POST must contain the specified field name. If it is specified, please check the order of the fields. 400 Bad Request

Default Metadata Values

Several metadata fields will receive default values if none is specified at item creation time:

Field Default Value
uploader The username of the Internet Archive patron used to create the item.
mediatype data
collection opensource
title The identifier specified for the item.
addeddate The current date and time formatted as YYYY-mm-dd hh:mm:ss
publicdate The current date and time formatted as YYYY-mm-dd hh:mm:ss

Other metadata fields will not be added to the item unless explicitly specifed.

Example IDENTIFIER_marc.xml file

The japanesefairytal00ozak_marc.xml file for Japanese Fairy Tales:

<record xmlns="http://www.loc.gov/MARC21/slim">
  <leader>00871nam a2200253 4500</leader>
  <controlfield tag="001">ocm15627400</controlfield>
  <controlfield tag="005">20060625113632.0</controlfield>
  <controlfield tag="008">900121s1903 nyua j 000 0 eng d</controlfield>
  <datafield tag="035" ind1="" ind2="">
    <subfield code="a">902182803</subfield>
  </datafield>
  <datafield tag="040" ind1="" ind2="">
    <subfield code="a">CLO</subfield>
    <subfield code="c">CLO</subfield>
    <subfield code="d">m/c</subfield>
    <subfield code="d">BNY</subfield>
    <subfield code="d">UtOrBLW</subfield>
  </datafield>
  <datafield tag="041" ind1="1" ind2="">
    <subfield code="a">engeng</subfield>
  </datafield>
  <datafield tag="043" ind1="" ind2="">
    <subfield code="a">a-ja---</subfield>
  </datafield>
  <datafield tag="091" ind1="" ind2="">
    <subfield code="p">J</subfield>
    <subfield code="a">398</subfield>
    <subfield code="c">O</subfield>
  </datafield>
  <datafield tag="245" ind1="0" ind2="0">
    <subfield code="a">Japanese fairy tales /</subfield>
    <subfield code="c">compiled by Yei Theodora Ozaki ; profusely
    illustrated by Japanese artists.</subfield>
  </datafield>
  <datafield tag="260" ind1="" ind2="">
    <subfield code="a">New York :</subfield>
    <subfield code="b">Grosset & Dunlap,</subfield>
    <subfield code="c">[preface 1903]</subfield>
  </datafield>
  <datafield tag="300" ind1="" ind2="">
    <subfield code="a">vii, 305 p., [1] leaf of plates :</subfield>
    <subfield code="b">ill. ;</subfield>
    <subfield code="c">22 cm.</subfield>
  </datafield>
  <datafield tag="590" ind1="" ind2="">
    <subfield code="a">NY3</subfield>
  </datafield>
  <datafield tag="650" ind1="" ind2="0">
    <subfield code="a">Fairy tales</subfield>
    <subfield code="z">Japan.</subfield>
  </datafield>
  <datafield tag="650" ind1="" ind2="0">
    <subfield code="a">Folklore</subfield>
    <subfield code="z">Japan.</subfield>
  </datafield>
  <datafield tag="700" ind1="1" ind2="">
    <subfield code="a">Ozaki, Yei Theodora.</subfield>
  </datafield>
  <datafield tag="923" ind1="" ind2="">
    <subfield code="a">j</subfield>
  </datafield>
  <datafield tag="995" ind1="" ind2="">
    <subfield code="a">59521</subfield>
  </datafield>
  <datafield tag="920" ind1="" ind2="">
    <subfield code="a">Donnell Library Center</subfield>
    <subfield code="b">J 398 O</subfield>
    <subfield code="c">checked Out</subfield>
    <subfield code="d">Children's Room Stacks</subfield>
    <subfield code="r">A</subfield>
    <subfield code="z">DLC</subfield>
  </datafield>
  <datafield tag="920" ind1="" ind2="">
    <subfield code="a">96th Street Branch</subfield>
    <subfield code="b">J 398 O</subfield>
    <subfield code="c">checked In</subfield>
    <subfield code="d">CR Reading Room Collection</subfield>
    <subfield code="r">A</subfield>
    <subfield code="z">NSR</subfield>
  </datafield>
</record>

The original file may be viewed here.

Example IDENTIFIER_meta.mrc file

The japanesefairytal00ozak_meta.mrc file for Japanese Fairy Tales (line breaks added and control characters converted to ASCII representations for readability):

00871nam  2200253   450000100130000000500170001300800410003003500140007104000320008504
1001100117043001200128091001400140245010400154260005000258300005400308590000800362650
002400370650002100394700002500415923000600440995001000446920008100456920008000537^^oc
m15627400 ^^20060625113632.0^^900121s1903    nyua   j      000 0 eng d^^  ^_a90218280
3^^  ^_aCLO^_cCLO^_dm/c^_dBNY^_dUtOrBLW^^ ^_aengeng^^  ^_aa-ja---^^  ^_pJ^_a398^_cO^^
0^_aJapanese fairy tales /^_ccompiled by Yei Theodora Ozaki ; profusely illustrated b
y Japanese artists.^^  ^_aNew York :^_bGrosset & Dunlap,^_c[preface 1903]^^  ^_avii, 
305 p., [1] leaf of plates :^_bill. ;^_c22 cm.^^  ^_aNY3^^ 0^_aFairy tales^_zJapan.^^
 0^_aFolklore^_zJapan.^^1 ^_aOzaki, Yei Theodora.^^  ^_aj^^  ^_a59521^^  ^_aDonnell L
ibrary Center^_bJ 398 O^_cchecked Out^_dChildren's Room Stacks^_rA^_zDLC^^  ^_a96th S
treet Branch^_bJ 398 O^_cchecked In^_dCR Reading Room Collection^_rA^_zNSR^^^]

The original file may be viewed here.